Smoothing fine-grained PCFG lexicons

نویسندگان

  • Tejaswini Deoskar
  • Mats Rooth
  • Khalil Sima'an
چکیده

We present an approach for smoothing treebank-PCFG lexicons by interpolating treebank lexical parameter estimates with estimates obtained from unannotated data via the Inside-outside algorithm. The PCFG has complex lexical categories, making relative-frequency estimates from a treebank very sparse. This kind of smoothing for complex lexical categories results in improved parsing performance, with a particular advantage in identifying obligatory arguments subcategorized by verbs unseen in the treebank.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ckylark: A More Robust PCFG-LA Parser

This paper describes Ckylark, a PCFG-LA style phrase structure parser that is more robust than other parsers in the genre. PCFG-LA parsers are known to achieve highly competitive performance, but sometimes the parsing process fails completely, and no parses can be generated. Ckylark introduces three new techniques that prevent possible causes for parsing failure: outputting intermediate results...

متن کامل

Corpus Induction of Lexicons for Treebank PCFGs by Inside-Outside Estimation and Frequency Transformations

We describe procedures which pool lexical information from a treebank with frequency information estimated from an unannotated corpus with the insideoutside algorithm. PCFG parameters for non-lexical productions are obtained purely from the treebank. The procedures produce substantial improvements (upto 20.34%) on the task of determining valences of tokens of novel verbs, relative to a smoothed...

متن کامل

Building a fine-grained subjectivity lexicon from a web corpus

In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of ...

متن کامل

Appropriately Handled Prosodic Breaks Help PCFG Parsing

This paper investigates using prosodic information in the form of ToBI break indexes for parsing spontaneous speech. We revisit two previously studied approaches, one that hurt parsing performance and one that achieved minor improvements, and propose a new method that aims to better integrate prosodic breaks into parsing. Although these approaches can improve the performance of basic probabilis...

متن کامل

Korean Twitter Emotion Classification Using Automatically Built Emotion Lexicons and Fine-Grained Features

In recent years many people have begun to express their thoughts and opinions on Twitter. Naturally, Twitter has become an effective source to investigate people’s emotions for numerous applications. Classifying only positive and negative tweets has been exploited in depth, whereas analyzing finer emotions is still a difficult task. More elaborate emotion lexicons should be developed to deal wi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009